Weakly supervised learning for multi-organ adenocarcinoma classification in whole slide images

The primary screening by automated computational pathology algorithms of the presence or absence of adenocarcinoma in biopsy specimens (e.g., endoscopic biopsy, transbronchial lung biopsy, and needle biopsy) of possible primary organs (e.g., stomach, colon, lung, and breast) and radical lymph node dissection specimen is very useful and should be a powerful tool to assist surgical pathologists in routine histopathological diagnostic workflow. In this paper, we trained multi-organ deep learning models to classify adenocarcinoma in biopsy and radical lymph node dissection specimens whole slide images (WSIs). We evaluated the models on five independent test sets (stomach, colon, lung, breast, lymph nodes) to demonstrate the feasibility in multi-organ and lymph nodes specimens from different medical institutions, achieving receiver operating characteristic areas under the curves (ROC-AUCs) in the range of 0.91 -0.98.


Introduction
Adenocarcinoma is a type of carcinoma that has the propensity to differentiate into glandular, ductal, and acinar cells in several organs (e.g., stomach, colon, lung, and breast). According to the Global Cancer Statistics 2020 [1], number of new deaths and % of all sites for stomach, colon, lung, and breast cancers were as follows: 768,793 cases (7.7%) in stomach, 576,858 cases (5.8%) in colon, 1,796,144 cases (18.0%) in lung, and 684,996 cases (6.9%) in breast. Adenocarcinoma is the most common type of cancer affecting these four organs, so that adenocarcinoma classification in the primary organs especially on biopsy specimens is one of the most important histopathological inspection in clinical workflow to determine the strategies of cancer treatment. Moreover, lymph nodes are the most common site of metastatic adenocarcinoma, and can be constituted the first clinical manifestation of the cancer. The important clinical practice of the surgical pathologist is to identify the presence or absence of a malignant process in the lymph node. If cancer cells are identified within the efferent lymph vessels and extra-nodal tissues, it is necessary to note in the pathological report because of the possible prognostic significance. Histopathological evaluation of lymph node metastasis is very important for staging of tumors, documentation of tumor recurrence, and prediction of the most probable primary site for a metastatic cancer of uncertain primary site. However, in the routine practical diagnosis, frequently there are numerous number of lymph nodes to be inspected in a single glass slide and there are number of radical lymph node dissection specimen glass slides in the same patient, which should be a workload burden for surgical pathologists. The incorporation of deep learning models in routine histopathological diagnostic workflow is on the horizon and is a promising technology, allowing the potential of reducing the burden of time-consuming diagnosis and increasing the detection rate of anomalies including cancers. Deep learning has been widely applied in tissue classification and adenocarcinoma detection on whole-slide images (WSIs), cellular detection and segmentation, and the stratification of patient outcomes [2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Previous works have looked into applying deep learning models for adenocarcinoma classification separately for different organ, such as stomach [15][16][17], colon [15,18], lung [16,19], and breast [20,21] histopathological specimen WSIs. Although these existing models exhibited very high ROC-AUCs for each organ, they cannot classify adenocarcinoma across organs accurately.
In this study, we trained deep learning models using weakly-supervised learning to predict adenocarcinoma in WSIs of stomach, colon, lung, and breast biopsy specimens for primary tumors as well as radical lymph node dissection specimens for metastatic carcinoma using training datasets for stomach, colon, lung, and breast biopsy specimen WSIs without annotations. We evaluated the models on each primary organ biopsy specimen (stomach, colon, lung, and breast) as well as radical lymph node dissection specimens to evaluate presence or absence of metastatic adenocarcinoma, achieving and ROC-AUC from 0.91 to 0.9 8. Our results suggest that deep learning algorithms might be useful for histopathological diagnostic aids for adenocarcinoma classification in primary organs and lymph node metastatic cancer screening.

Clinical cases and pathological records
In the present retrospective study, a total of 8,896 H&E (hematoxylin & eosin) stained histopathological specimen slides of human adenocarcinoma and non-adenocarcinoma (adenoma and non-neoplastic) were collected from the surgical pathology files of five hospitals: International University of Health and Welfare (IUHW), Mita Hospital (Tokyo, Japan) and Kamachi Group Hospitals (total four hospitals: Wajiro, Shinkuki, Shinkomonji, and Shinmizumaki Hospital) (Fukuoka, Japan) after histopathological review by surgical pathologists. Adenoma cases were included as adenoma is a common differential diagnosis and exhibits some similarities to adenocarcinoma. The histopathological specimens were selected randomly to reflect a real clinical settings as much as possible. Prior to the experimental procedures, each WSI diagnosis was observed by at least two pathologists with the final checking and verification performed by senior pathologists. All WSIs were scanned at a magnification of x20 using the same Leica Aperio AT2 Digital Whole Slide Scanner (Leica Biosystems, Tokyo, Japan) and were saved as SVS file format with JPEG2000 compression.

Dataset
Hospitals which provided histopathological specimen slides were anonymised by randomly assigning a letter (e.g., Hospital-A, B, C, D, and E). Table 1 breaks down the distribution of training sets from four domestic hospitals (Hospital-A, B, C, and D). Table 2 shows the distribution of 1K (1,000 WSIs), 2K (2,000 WSIs), and 4K (4,000 WSIs) training sets. Validation sets were selected randomly from the training sets and the numbers of validation sets were given in parentheses ( Table 2). The distribution of test sets from five domestic hospitals (Hospital-A, B, C, D, and E) was summarized in Table 3. In both training and test sets, stomach, colon, lung, corresponding author on reasonable request. The datasets that support the findings of this study are available from International University of Health and Welfare, Mita Hospital (Tokyo, Japan) and Kamachi Group Hospitals (Fukuoka, Japan), but restrictions apply to the availability of these data, which were used under a data use agreement which was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour and Welfare, and so are not publicly available. The data contains potentially sensitive information. However, the data are available from the authors upon reasonable request for private viewing and with permission from the corresponding medical institutions within the terms of the data use agreement and if compliant with the ethical and legal requirements as stipulated by the Japanese Ministry of Health, Labour and Welfare. and breast WSIs solely consisted of biopsy (stomach and colon: endoscopic biopsy, lung: transbronchial lung biopsy (TBLB), breast: needle biopsy) specimens and lymph node WSIs consisted of radical dissection specimens (Tables 1-3). The distribution of lymph nodes using test sets were summatized in Table 4. All training sets WSIs were not manually annotated and the training algorithm only used the WSI labels which were extracted from the histopathological diagnostic reports after reviewing surgical pathologists; meaning that the only information available for the training was whether the WSI contained adenocarcinoma or non-adenocarcinoma but no information available about the location of the cancerous lesions.

Deep learning models
In this study, we used the EfficientNetB1 [22] as the architecture of our models. We observed no further improvements from using larger models. We used the partial fine-tuning approach [23] to train them. This method starts with an existing pre-trained models on ImageNet and fine-tunes only the affine parameters of the batch normalization layers and the final classification layer while leaving the remaining weights frozen. Fig 1 shows an overview of the training method.
As we only had WSI labels, we used a weakly supervised method to train the models. The training method is similar to the one described in [24].
WSIs typically have large areas of white background that is not required for training the model and can easily be eliminated with preprocessing via thresholding using Otsu's method [25]. This creates a mask of the tissue regions from which it would then be possible to sample tiles in real-time using the OpenSlide library [26] by providing coordinates from the tissue regions.
For a given WSI, we obtained a single prediction on the slide-level using the following approach: we divided the WSIs into a grid with a fixed stride, and we applied the model in a sliding window fashion over the grid, resulting in a predictions for the entire tissue regions. We then took the maximum probability from all the tiles as used that as a slide-level probability of the WSI having ADC. During training, we initially performed a balanced random sampling of tiles from the tissue regions for first two epochs; this meant that we alternated between a positive WSI and a negative WSI and selecting an equal number of tiles from each. After the second epoch, we switched into hard mining of tiles, whereby we alternated between a positive WSI and a negative WSI; however, this time performing a sliding window inference on the entire tissue regions and then selecting the top k tiles with the highest probabilities for being positive. If the WSI is negative, this effectively selects the tiles most likely to be false positives. The selected tiles were placed in a training subset, and once that subset contained N tiles, a training was run whereby the model weights get updated. We used k = 8, N = 256, and a batch size of 32.
In addition, during training, we performed data augmentation of the images by performing random shifts in brightness, contrast, hue and saturation, and rotation angles as well as horizontal and vertical flipping.
We optimised the model weights by minimising the binary cross-entropy loss using the Adam optimization algorithm [27] with the following parameters: beta 1 = 0.9, beta 2 = 0.999 and a learning rate of 0.001. We applied a learning rate decay of 0.95 every 2 epochs. We used early stopping by tracking the performance of the model on a validation set; this allows stopping the training when no improvement was observed for more than 10 epochs. The model with the lowest validation loss was chosen as the final model.

Software and statistical analysis
The deep learning models were implemented and trained using TensorFlow [28]. AUCs were calculated in python using the scikit-learn package [29] and plotted using matplotlib [30]. The 95% CIs of the AUCs were estimated using the bootstrap method [31] with 1000 iterations. The true positive rate (TPR) was computed as and the false positive rate (FPR) was computed as Where TP, FP, and TN represent true positive, false positive, and true negative, respectively. The ROC curve was computed by varying the probability threshold from 0.0 to 1.0 and computing both the TPR and FPR at the given threshold.

Compliance with ethical standards
The experimental protocol was approved by the ethical board of International University of Health and Welfare (No. 19-Im-007) and Kamachi Group Hospitals (No. 173). All research activities complied with all relevant ethical regulations and were performed in accordance with relevant guidelines and regulations in the all hospitals mentioned above.

Availability of data and material
The datasets generated during and/or analysed during the current study are not publicly available due to specific institutional requirements governing privacy protection but are available from the corresponding author on reasonable request. The datasets that support the findings of this study are available from International University of Health and Welfare, Mita Hospital (Tokyo, Japan) and Kamachi Group Hospitals (Fukuoka, Japan), but restrictions apply to the availability of these data, which were used under a data use agreement which was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour and Welfare, and so are not publicly available. The data contains potentially sensitive information. However, the data are available from the authors upon reasonable request for private viewing and with permission from the corresponding medical institutions within the terms of the data use agreement and if compliant with the ethical and legal requirements as stipulated by the Japanese Ministry of Health, Labour and Welfare.

Insufficient AUC performance of WSI adenocarcinoma evaluation using existing stomach adenocarcinoma classification model
Prior to the training of multi-organ adenocarcinoma model, we have demonstrated the existing stomach adenocarcinoma classification model [15] AUC performance on test sets (Table 3). Table 5 and Fig 2A show that stomach and colon endoscopic biopsy WSIs exhibited high ROC-AUC and low log loss values but not in lung TBLB, breast needle biopsy, and radical lymph node dissection WSIs. Thus, we have trained the models using different WSI number of training sets (Table 2).

True positive adenocarcinoma prediction of radical lymph node dissection (lymphadenectomy) WSIs
A lymphadenectomy (radical lymph node dissection) is a surgical procedure to evaluate evidence of metastatic cancer. In routine histopathological diagnosis, the histopathological inspection of lymph nodes is one of the very important but time-consuming task to avoid the risk of medical oversight. Therefore, in clinical settings, the multi-organ adenocarcinoma model is more useful when performing histopathological diagnosis of lymphadenectomy specimen WSIs. Our model (WS-4K: 224, x10 EfficientNetB1) perfectly predicted metastatic lung adenocarcinoma (Fig 4A-4D) and breast invasive ductal carcinoma (Fig 4E-4J). The heatmap images showed true negative predictions (Fig 4B) of internal non-neoplastic lymph nodes (Fig 4A). Importantly, adenocarcinoma localization areas in both metastatic lung

PLOS ONE
Deep learning models for multi-organ adenocarcinoma classification adenocarcinoma ( Fig 4C) and breast invasive ductal carcinoma (Fig 4G) are positively predicted by heatmap images (Fig 4D and 4H).

True negative adenocarcinoma prediction of radical lymph node dissection (lymphadenectomy) WSIs
Our model (WS-4K: 224, x10 EfficientNetB1) showed true negative predictions of metastatic adenocarcinoma in lymph nodes without evidence of cancer metastasis (Fig 5). In Fig 5A, there were numbers of lymph nodes with broad ranging of size (small to large) and shape (round to irregular) which were not predicted as metastatic lymph nodes ( Fig 5B). Moreover, in Fig 5C, the lymph node was enlarged due to lymphadenitis (Fig 5E) but without evidence of metastatic adenocarcinoma which were not predicted as metastatic lymph nodes (Fig 5D).

False positive adenocarcinoma prediction of radical lymph node dissection (lymphadenectomy) WSIs
Histopathologically, Fig 6A shows   6B, 6D and 6F). These tissue areas (Fig 6C and 6E) showed dense hematoxylic artifacts induced by crushing during specimen handling procedures which could be the primary cause of false positive due to its morphological similarity to irregular shaped and dense nuclei in adenocarcinoma cells.

False negative adenocarcinoma prediction of radical lymph node dissection (lymphadenectomy) WSIs
In Fig 7A, histopathologically, only two metastatic colon adenocarcinoma foci were observed in the left-most lymph node (Fig 7C). After double checking two independent pathologists, there were no more metastatic adenocarcinoma cells in Fig 7A. However, the heatmap image did not predict any adenocarcinoma cells (Fig 7B).

Discussion
In the present study, we trained multi-organ deep learning models for the classification of adenocarcinoma in WSIs using weakly-supervised learning. The models were trained on WSIs obtained from four medical institutions and were then applied on multi-organ test sets obtained from five medical institutions to demonstrate the generalisation of the model on unseen data. The deep learning model (WS-4K: 224, x10 EfficientNetB1) achieved ROC-AUCs in the range of 0.91-0.9 8. So far, we have been investigating adenocarcinoma classification on histopathological WSIs in diverse organs (e.g., stomach [15][16][17], colon [15,18], lung [19,24], and breast [20,21]). These models are specific to each organ, and versatile adenocarcinoma histopathological classification model(s) which can be applied in multi-organ have not been developed to date. The global adenocarcinoma classification model in multi-organ may play key roles in first-screening processes especially radical lymph node dissection specimens which consist of a large number of lymph nodes in a single WSI in routine pathological diagnosis in the clinical laboratories.
Prior to the training, we have demonstrated the versatility of the existing models. For example, the existing stomach adenocarcinoma classification model [15] exhibited scores of high ROC-AUC and low log loss for the stomach and colon endoscopic biopsy test sets, but not for the lung, breast, and lymph node test sets (Table 5). Therefore, we have trained the deep learning models from scratch by the weakly-supervised learning approach in this study.
We have collected histopathological H&E stained specimens from as many medical institutions as possible to ensure diversities of histopathological variability and specimen quality in training sets (Table 1). In the training sets, we did not include radical lymph node dissection specimens because we would like to train the model based on the primary organs and predict metastatic adenocarcinoma in lymph nodes. In all training sets (1K, 2K, and 4K), WSIs from each organ (stomach, colon, lung, and breast) were equally distributed ( Table 2).
In this study, we showed that it was possible to exploit the use of a moderate size training sets of 2,000 (2K) and 4,000 (4K) WSIs to train deep learning models using a weakly-supervised learning, and we have obtained high ROC-AUC performance on primary organ (stomach, colon, lung, and breast) and radical lymph node dissection test sets, which is highly promising in terms of the generalisation performance of our models to classify adenocarcinoma in multi-organs. Using the weakly-supervised learning method allowed us to train on our datasets and obtain high performance without manually performed annotations. This means that it is possible to train a very high performance model for any type of cancer classification in multi-organ without having to have detailed cellular level or rough annotations or requiring an extremely large number of WSI. We have demonstrated the usefulness of weaklysupervised learning approach for lung carcinoma classification [24]. Importantly, there were no significant difference in ROC-AUC and log loss results between 2K and 4K training sets, meaning that small number (total 2,000 WSIs) of training datasets were enough for adenocarcinoma classification in multi-organ.
Our model satisfactorily predicted adenocarcinoma areas not only in primary organs (stomach, colon, lung, and breast) (Fig 3) but also in radical lymph node dissection specimens (Fig 4). In routine histopathological diagnosis, inspecting cancer metastasis in lymph nodes is laborious because usually there are a lot of lymph nodes with wide variety of sizes and shapes in glass slides. Our model can localise the prediction of adenocarcinoma invasion and visualise them as heatmap images (Fig 4) which would be a great tool for primary screening or doublecheck purpose in clinical workflow in laboratories. Importantly, our model can evaluate adenocarcinoma-free (non-metastatic) lymph nodes (Fig 5) which reflected high specificity (0.931) ( Table 7). This is an important finding to apply our model in clinical workflow. This study is not without limitations. One limitation is the use of a single scanner type for the majority of collected cases. Another limitation is the presence of false positive/negatives. The false positives seem to be primarily caused by the dense hematoxylic artifacts induced by crushing during specimen handling procedures which have morphological similarities to adenocarcinoma cell clusters with irregular shaped and dense nuclei (Fig 6). Another major limitation is that the models were not validated in independent cohorts from different institutions.